Our group topic is cancer genomics, and we worked with the curatedBladderData data set for this project, specifically attempting to mimic results from this paper. To give you a little background on why we care about bladder cancer, we have some info below from the American Cancer Society to bring it into focus.
The American Cancer Society estimates that there will be around 80,000 new cases of bladder cancer in the US in 2017, constituting approximately 5% of all new cancer cases. Of these, men are expected to be affected at 3 times the rate of women, and there are about 17,000 total deaths expected.
Bladder cancer tumors can be divided into two groups: superficial and muscle-invasive. This relates to staging of cancer (0/a, 1, 2, 3, 4) as follows. Generally, the 0 or a stage & 1 are considered superficial tumors because they haven’t yet invaded the muscular wall of the bladder; stage 2 is characterized by the tumor invading the muscle wall of the bladder, and so stage 2 and beyond are classified as muscle-invasive tumors. Around half of all bladder cancers are first detected before becoming muscle-invasive.
The particular data set we chose to use in the curatedBladderData collection contains gene expression of 6225 genes for each of 80 subjects with bladder tumors. Some covariates of interest that we make use of later on are tumor subtype (superficial, muscle-invasive) and tumor stage.
As we began working with the data set, it became apparent that there was an abundance of missing data for gene expression. In figuring out how to deal with this, we plotted gene expression standard deviation against the number of subjects with missing values for each gene.
As you can see, there is still a lot of variance in some genes with a lot of missing data, so we don’t want to throw this info away. Hence, we made the decision to use median imputation to impute gene expression across genes for the remainder of our project.
The 80 bladder tumors were obtained frozen from the tissue bank of the UCSF Comprehensive Cancer Center, and the tumors were staged according to professionals from the American Joint Committee on Cancer. Of note, the paper mentioned above claims that the 80 tumors can be broken up into 17 stage a, 10 stage 1, 15 stage 2, 25 stage 3, and 13 stage 4. However, upon examining the data ourselves, we believe this to be a typo in the paper or there may be an error present in the data set. From the data set we used, we have that there are 17 stage a, 10 stage 1, 14 stage 2, 26 stage 3, and 13 stage 4, and this is the data we will use for the remainder of our project.
We have 3 main objectives that we will cover throughout this project:
Firstly, we attempt to recreate a cluster dendrogram to resemble Fig 1 in the paper mentioned above. Several methods for clustering will be explored and compared to Fig 1 of the paper.
Secondly, we investigate differences in gene expression levels for superficial and muscle-invasive tumors using heatmaps. In the paper, they created a heatmap (Fig 2) using an unspecified subset of around 2k genes (same set they used to create their cluster dendrogram). We didn’t have a way to find out what these genes were, so we created several of our own subsets and generated heatmaps and compared them.
Thirdly, we fit our subsets of genes using various methods to predict, based on expression, whether each tumor should be classified as superficial or muscle-invasive. We then used 5-fold cross-validation to obtain a mis-classification rate for each combination of fitting method and gene subset. To try and get a more accurate measurement of the true mis-classification rate, we run 10 cross-validations and take the average error across the 10 iterations.
Signal intensity values of each element were extracted using the GenePix software.
Intensities were then corrected by subarray median centering followed by global locally weighted scatterplot smoothing correction
Values are normalized and median centered
In this section, we attempt to recreate Figure 1 in the paper with a few differences described below.
Filtering of genes: log2 ratios were filtered to include only these clones showing expression ratios in at least 80% of samples analyzed Only clones that had |log2 ratio|>2 in at least one sample were used
Unsupervised hierarchical clustering
Linkage method: Average linkage (join clusters whose avg distance are the smallest)
Distance/Similarity matrix: uncentered Pearson correlation metric
Bottom-up clustering direction
We cannot perform the filtering as described by the paper, hence we chose a few other ways from class we can try to reduce the number of dimensions. We also explore other methods of clustering in order to examine if the choices about the clustering algorithm with regards to imputation, filtering, and clustering method made by the author make a difference in the validity of the result. In other words, we examine if the validity of the claim in the paper is sensitive to these choices.
We try three different approaches of clustering with different choices of how we select the genes to include, and the imputation. We also use a slightly different clustering algorithm from the paper. We chose hierarchical clustering with ward’s method.
Filtering: we just select top 100 genes with highest variance (before imputing);
Imputation: Median impute
Non-missing genes only (621 genes left);
10 principal components
Median Impute ; 500 random genes
Number 0 to 4 denote the stage of tumor. The red color in the colored bar at the bottom denotes superficial tumor (stage 0 and stage 1) while blue denotes the muscle invasive tumor. ‘tcc’ and ‘squ’ refers to the histological type.
##
## cluster invasive superficial
## 1 48 1
## 2 5 26
##
## cluster squamous tcc
## 1 6 43
## 2 0 31
##
## cluster 0 1 2 3 4
## 1 1 0 10 25 13
## 2 16 10 4 1 0
We see here that the two clusters are distinct in terms of the proportion of superficial and muscle-invasive tumor. we only have 6 misclassifications (5 in superficial and 1 in muscle invasive). Note also that all the ‘squamous’ histological type tumor are in the same cluster. This is consistent with the results reported by the paper which also have 6 similar misclassifications. This shows that at least in this example, the result is not sensitive to the choice of filtering, imputation, and clustering algorithm. We can still clearly distinguish the superficial and the muscle invasive tumor based on the unsupervised cluster.
##
## cluster invasive superficial
## 1 49 2
## 2 4 25
Here we also have 6 misclassifications (4 in superficial and 2 in muscle invasive). This shows that even with just 10 principle components based on the 621 non-missing genes compared to the 2,675 genes the paper uses after filtering, we can still have two distinct clusters with the same misclassifications.
An example of a hierarchical cluster of 500 random genes. We can still see that there are two distinct clusters.
The histogram shows that 80 percent of the times when we randomly select 500 genes, we still obtain less than 10 misclassifications. This shows that there might be some batch effect across the genes between the two types of tumors because even when we randomly pick genes, we can still differentiate the two types of tumor. Another possible explanation is that there are signals in a lot of these genes and therefore even when we randomly pick, we are still getting some signals.
To try and determine different genes that are important in distinguishing the difference between the superficial and muscle-invasive tumors, we use the cross-validation procedure described above to find certain sets of genes. With 5-fold cross-validation repeated 10 times, we have 10 potential sets of “important” genes. Within each run, we can select genes that are considered differntially expressed once, twice, or any number up to all five times.
We try this below and see some interesting items. It should be noted that in these plots, there may be some genes with extreme values, which will cause the general pattern to become faded. This flaw unfortunately was unable to be addressed at this time.
Firstly, we look at the sets of genes selected as differntially expressed with a p-value cutoff of 0.000001. We consider three different subsets of the selected genes based upon the number of times the gene was considered differntially expressed: 1, 2, 5. It should be mentioned again that occurring 5 times means that any gene that was included in all 5 folds of any of the 10 repetitions will be included.
We notice that a more distinct patter seems to appear as we limit the number of genes that are included. This may be due to some genes simply being noise in the larger sets, or may also be due to some washing out of gene expression levels due to higher scales of some included genes (as can be seen in the legend).
We look further and consider a couple of graphs where the cutoff for the p-value is not as extreme.
Here we again see a similar story to what was seen with the more extreme p-value cut-off.
Finally, we consider a random selection of genes to see if we see a distinct pattern between the two tumor types. If this is the case, then we either would need to conclude that all of the genes are impacted by the tumor type (which may be indicative of a potential batch effect), or that even a random sample is good at differentiating between the two tumor types. We randomly select 50 genes for this map. For a good comparison, we select the 50 “most significant” differntially expressed genes and create another heatmap to see if there is a difference between the two. In either, we use any gene selected in any fold, giving us ~250 randomly selected genes in the random heat map (possibly less if the same gene was selected in more than one fold).
We do not see a clear distinction in the randomly selected genes. Rather, we see a fairly even color scheme across all subjects. However, we do see a good distinction in the most significant genes. Thus, it does appear that modeling and the analysis are worthwhile in this pursuit.
Finally, we consider the predictive ability of several methods using several approaches to selecting genes. Here, we use Random Forests, GLM Boosting, and Support Vector Machines. Within these, we use a selection procedure as used above on the heatmaps to determine which genes should be used in the model. Namely, we select all genes in a fold considered differntially expressed (p-value less than 0.000001, and 0.00001), the 50 genes with the smallest p-values, and 50 randomly sampled genes. We can see in Table the results.
| p<0.000001 | p<0.00001 | Top 50 | Random 50 | |
|---|---|---|---|---|
| Random Forest | 0.0877843 | 0.0753088 | 0.0860343 | 0.1561814 |
| GLM Boost | 0.0950000 | 0.0955441 | 0.0972500 | 0.1271275 |
| SVM | 0.0851716 | 0.0819265 | 0.1042255 | 0.1772647 |
We see the best results using Random Forests with the slightly larger set of genes (p < 0.00001). Random Forests does perform the best with the smallest set of genes (Top 50), and all methods perform worst using the random draw. However, we do see fairly remarkable accuracy with the random draw, which might again indicate a potential batch effect in the data. Or it may be that a single gene that is differntially expressed is randomly included and does manage to provide enough information for these methods to make some distinction.
Of additional note, GLM Boosting appears to perform similarly well across all selections of related genes, SVM performs better with a larger set of genes, and performs the worst with the random draw.
Thus, we can see that many methods exist that can be used to determine differences between superficial and muscle-invasive tumors. While remarkable, the random sampling method does not perform as well as other methods, indicating that information is available in the gene expressions, and differences do exist between the tumor types.